Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new script hocr-cut for cutting a page #108

Merged
merged 6 commits into from
Sep 7, 2018

Conversation

zuphilip
Copy link
Collaborator

@zuphilip zuphilip commented Mar 18, 2017

This cuts a page (horizontally) into two pages in the middle
such that the most of the bounding boxes are separated nicely,
e.g. cutting double pages or double columns.

For example this double pages

litver

is cut in the middle and outputs a left and right page

The whole computation is based on the bounding boxes, and therefore needs the input of some OCR or layout segmentation process. But it might be possible to OCR the individual pages afterwards again to receive better results then (e.g. skewing might be more consistent along one page compared to a double page).

@zuphilip zuphilip changed the title Add new script hocr-cut for cutting a pages Add new script hocr-cut for cutting a page Mar 18, 2017
zuphilip and others added 4 commits September 5, 2018 07:44
This cuts a page (horizontally) into two pages in the middle
such that the most of the bounding boxes are separated nicely,
e.g. cutting double pages or double columns.
It was fixed using `yapf -i --style pep8 hocr-cut`.

Signed-off-by: Stefan Weil <[email protected]>
Tesseract uses image names enclosed in "" which must be stripped
because otherwise opening the image will fail.

Signed-off-by: Stefan Weil <[email protected]>
Copy link
Contributor

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO useful to include in master.

@stweil stweil merged commit adb810c into ocropus:master Sep 7, 2018
@stweil stweil deleted the split-pages branch September 7, 2018 14:16
@stweil
Copy link
Collaborator

stweil commented Sep 7, 2018

Done. Thank you, Philipp and Konstantin, for the contribution and the review.

@stweil stweil self-assigned this Sep 7, 2018
@stweil
Copy link
Collaborator

stweil commented Sep 7, 2018

Should we tag a new release based on master? 1.3.0?

@stweil
Copy link
Collaborator

stweil commented Sep 7, 2018

The script could be extended to create two new hOCR files for left and right page, too.

@zuphilip
Copy link
Collaborator Author

zuphilip commented Sep 7, 2018

New release sounds good, but there is already one drafted. Sorry forgot about this. Maybe we can do two new releases 1.2.1 and 1.3.0?

Improving the script sounds fine, also I expect that after cutting a double page into two single pages, it might be better to run OCR on each of those again.

@stweil
Copy link
Collaborator

stweil commented Sep 7, 2018

Let's start with 1.2.1, then create 1.3.0.

Running OCR again on the single pages is reasonable, but can cost a lot of resources if many pages have to be processed, so separated hOCR from the initial double pages can be desired in certain situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants